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MODELS rOR A BEGINNING THEORY OF CRITERION-REFEKENCF.D TESTS 



A frequently quoted definition of a criterion- referenced test is this 
one given by Glaser and Nitko (1971): "A criterion-referenced test is one 
that is deliberately constructed to yield measurements that are directly 
interpretable in terms of specified performance standards." I endorse this 
definition, particularly in the context of this paper. They go on to say: 
"Performance standards are generally specified by defining a class or domain 
of tasks that should be performed by the individual. Measurements are taken 
on representative samples of tasks dravm from this domain, and such measure- 
ments are referenced directly to this domain for each individual measured." 
Defining the class or domain of tasks is the function of an instructional 
objective. A well written, or "good," instructional objective will give a 
reasonably unambiguous definition of the domain it is intended to specify, 
so that one will be able to tell whether or not a given item or task is 
within the domain or outside of it. Note that the requirement is that 
"measurements are taken on representative samples of tasks drawn from . . . 
(the) domain." The samples are not (necessarily) randomly drawn. For many 
significant instructional objectives, the donain of tasks cannot be ennumer- 
ated and, indeed, may be infinite in number. The wording of the instruc- 
tional objective, though, should be 2lear enough so that it is possible to 
construct tasks or test items that represent the domain with some degree of 
confidence. 

Some writers (e.g., Emrick, 1971; Livingston, 1972) use the phrase 
"criterion-'referenced test" to indicate a collection of items all of which 
are intended to measure the same objective. The definition above allows the 
phrase to indicate a collection of itoms l\u:l mc-arjure an organized set of 
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related objectives giving as many scores as objectives represented in the 
test. When the latter is the case, important collateral information way 
be available in the test itself to improve the accuracy of the scores for 
the represented objectives (see, for example, Roudabush and Green, 1972; 
Humbleton and Novick, 1973). Either interpretation may be appropriate 
depending upon the intended use of the score (s) or the nature of the sub- 
ject under discussion. 

An item on a paper and pencil test, when completed by a student, is 
a sample of that student's behavior. Furthermore, the item is a sample 
selected from some universe of all possible behaviors which might have been 
selected to represent or measure some particular domain of behaviors. The 
more limited the domain, the easier it is to select behaviors to represent 
it, and more confidence can be placed in the representativeness of the 
selected behaviors for the domain. If we assume a Platonic truth about a 
student with respect to his or her ability to perform the behaviors described 
by an objective and also assume a level of specificity of the objective such 
that (ideally) if a student can perform correctly one behavior from the 
domain, then he or she can perform them all, then for any given item on a 
test there are four possible outcomes: (1) the student cannot perform the 
behavior and does not get the item correct; (2) the student cannot perform 
the behavior but does get the item correct — a false positive score; (3) the 
student can perform the behavior but does not get the item correct — a false 
negative score; and (4) the student can perform the behavior and does get the 
item correct. If a large number of students were repeatedly tested with items 
sampling the domain of behaviors specified by such an objective, the propor- 
tion of correct rosponr,es of Individual students should approach a stable 
bimodal distribution. The lower mode would give the mean proportion for 
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sLudenL.s^ who cannot; porforin Lhe behavior and Llio upper mode would give the 
moiui proportion for students v/ho can perform the behavior. 

Implicit in this kind of specific objective is the idea that the stu- 
dent will be able to give a perfect performance every time or else he has 
not mastered the objective. In the process of deterraining mastery, some 
pragmatic standard of performance will have to be imposed and a margin for 
errors in classification tolerated. Tliese are practical problems of mea- 
surement, however, (for this kind of specific all-or-none objective) and 
not inherent in the objective as such. We really do want the student to 
be able to correctly add v;hole numbers all of the time and not just 90% of 
the time, but that is not to say that he or she will always do so. 

Most writers on criterion-referenced testing concern themselves with 
the concept of mastery or non-mastery of objectives, but implicitly or 
explicitly assume some underlying continuum of performance within the domain 
of an objective. The criterion of mastery on this continuum becomes of 
major concern. Humbleton and Novick (1973), for example, are concerned to 
more accurately estimate a student's standing with^nrespect to a cut-point. 
Hively, Maxwell, Rabehl, Sension, and Lundin (1973) and Miilman (1972) 
are concerned to estimate the number of items in a domain which a student 
"really" can correctly answer. Miilman quotes Ebel (1971) as follows: 
"... abilities, understandings, and appreciations are in the experience 
of almost everyone, not all-or-none adaptations. They are matters of 
degree. None but the simplest of them can ever be mastered completely by 
anyone [p. 287]" and professes agreement with that position. Undoubtedly 
many abilities, understandings, and appreciations arc a matter of degree, 
but I believe there are many that are not a matter of degree. They are, in 
fact, all-or-~noao occurrences and Lhoy are not ncH:cr.i^.arlly siiaple or unim- 
portant. Whether n measure of an objective seems continuous or dichotoiuous 
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probah.ly v*Ciponds upon the specificity of the objective and, in part, on the 
nature of the content of the objective. A measure of an objective specifying 
a heterogenous domain raay seein to reflect a continuum of ability, but, in 
fact, is made up of many dichotonious sub-objectives. Tabulating scores 
across individuals will give a distribution of scores that looks continuous • 
Further, Ebel is denying sudden int>ight, the "ah~haa" experience that is, 
1 would hope, in the experience of almost everyone. 

I am proposing, then, two models of the underlying nature of what is 
measured by a criterion-referenced test, each of which applies in some cases 
and not in others. Tlie first assumes an underlying all-or-none, dichotomous, 
"true" score and the second assumes an underlying continuous "true" score. 
The word "true" in "true" score was placed in quotation marks because I want 
to differentiate it from the usual interpretation it is given. For example, 
in mathematics it will generally not be satisfactory to know that a student 
at a particular tine in a particular situation did, in fact, correctly add 
two 3-digit numbers four times out of five. What we really wish to know is 
whether or not he or she is able to do the addition consistently over a long 
period of time with accuracy, that is, we wish to infer something about the 
state of the examinee with respect to his newly aquired ability to do addi- 
tion. We are still. concerned with potentially observable behavior and not 
with internal traits, dispositions, or values. Our concern is with a poten- 
tial or ability to behave in particular ways, which is one seep removed from 
direct observation. 

Now consider Figure 1 which I have labeled Case 1: a dichotomous mea- 
sure of a dichotomous true score. The probability of making an error, where 
an error is defined as cl ns.sifyj nf^ a person as havinj; Tnastcred tho objective 
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when, in fact, he has not or as classify i a person as having not mastered 
the objective when, in fact, he has, is' shown in the fi^^ure as the shaded 
portions, and is equal to P(X =1 | T = 0) + P(X = 0 | T = 1) . Jf the dichot- 
omous measure is a test item, then thi.^ figure represents the theoretical 
item characteristic curve. 

It is interesting to note that Klein and Clcary (1967) have shown that 
for a dichotomous measure of a dichotomous true score (which they call a 
Platonic true score) when both measure and true score take values of zero 
or one and error for an individual is therefore -1, 0, or +1, that the true 
score and error are negatively correlated and that the error can have a 
mean of zero only when the number of false positives equal the number of 
false negatives in the sample (x^hich is unlikely). They also show that a 
second dichotomous measure of the same true score will have errors corre- 
lated with the errors in the first measure. This, of course, violates some 
key assumptions of classical test theory and, among other things, causes 
inflated estimates of reliability using classical methods. WertF, Linn, 
and Joreskog (1973), using congeneric test theory (Joreskog, 1971), have 
shown that if the error score is allowed to take values other than -1, 0, 
and +1, then the classical assumptions can be met. Doing so, however, 
begins to make the true score look less than Platonic. 

Figure 2, which I have labeled Case 2: a pseudo continuous measure of 
a dichotomous true score, shov7s the case where some number, N, of m(^asures 
of the true score are usod to obtain an observed score, X, for an individual. 
In this case, there is a distribution of observed scores for people with a 
true score of zero and another distribution of observed scores for people 
with a true score; of one. The end of th(; suli;i liner; arc nc\->!U' to inch' rat o 
the mean or mode of the tv;o distributions. In this case. In order to get 
an estimate of the true score, a criterion observed score, X , or cut-point 
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must be established as indicated. The probability of an error of classifi- 
cation is then: P(X > X | T = 0) + P(X ^ X . I T = 1) • The error is shown 

— c c 

as the shaded area in the figure. 

If, in ignorance of the true score, we simply plot the frequency dis- 
tribution of the observed score, then wc combine the two distributions of 
Figure 2 into one and would expect (with the hypothesis of a dichotomous 
true score) that the distribution would be bimodal or, perhaps, U-shaped. 
Such a distribution is shown as the solid line in Figure 3. The 20 item 
criterion test in this figure was constructed to measure a single objective 
from the Prescriptive Mathematics Inventory (Gessel, 1972). The additional 
lines in Figure 3 will be described in a moment. 

Suppose that we have two independent measures of the same objective. 
Call one a CRT, scored zero or one, and the other a criterion, also scored 
zero or one. The criterion may be direct observation, teacher ratings or 
another test. We can then form the following table of observed frequencies: 

Criterion 





0 


1 




0 


^00 






1 


^0 








^.0 


^1 


N 



Where f^^ is the number of cases not showing mastery of the objective on 
either the CRT or the criterion, f^^^ is the number of casfes not showing 
mastery on the CRT, but showing mastery on the criterion, and so on. N is 
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the total number of cases in the sample. For a dichotonious true score, there 
is some trne number of cases, say N^, who have, if fact, not mastered the 
objective and some other number of cases, say N^, who have mastered the 
objective. The theoretical table of frequencies is then: 

Criterion 



CRT 





0 


1 




0 


% 


0 




1 


0 




h 




»0 


^1 


N 



Now let = P(X >^ | T = 0) = the probability that non-masters show 

mastery on the. CRT, 
^2 " ^(X i I T = 0) = the probability that non-masters show 

mastery on the criterion, 
= P(X < X^ I T = 1) = the probability that masters show 

non-mastery on the CRT, and 
^2 = P(X < X^ I T = 1) = the probability that masters show 

non-mastery on the criterion. 
From these definitions and the joint frequency tables, it can be shown that: 



^00 " ^0^^ " ""l^^^ '""Z^ Vl^2 » 

^01 = ^0^^ " ''l^''2 ^1^1 " ^2^ ' 

^10 " Vl^^ ^2^ ^1^^ ^1)^2 » 

^11 " Vl^2 ^1^^ " ^1^(^ - ^2^ • 
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[1] 
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Only three of these equations arc inJcpendcnt , .since any one of the frequen- 
cies can be obtained by subtracting the sum of the remaining three from N, 
the fixed sample siise. The system is under-determined since we have only 
three equations to solve for the five unknovms: a^, ^1 
(since N = + N^^) . 

If we assume that ^2 ^2 ' ^* »:hat is that the criterion admits of no 
error, then the following solutions obtain since there are now but three 
unknowns : 
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^ "^00 ^ ho 



R 01 
^1 f,, + f. 



11 01 



% = ^00 ^ ho 



y and 



«1 = N - ^0 = ^11 ^01 



[2] 



Consider the following table of observed frequencies: 



Criterion 



CRT 





0 


1 




0 


143 


39 


182 


1 


45 


291 


336 




188 


330 


518 
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In lliis table, the criterion score is the 20 iucni tent whose distribution 

appears in Figure 3 dichotomized at X = 11 and the CRT score is a single 

c * 

item from the Prescrj ptive Mithcmat jcs Inventory (PMI) that corresponds to 
the same objective. The values of a-, 3^, N. , and from equations 2 are: 



= .24, 

= -11, 
Nq = 188, and 
N, = 330. 



In this case, the probability of making a false positive classification is 
about two times the probability of making a false negative classification. 
In Figure 3, the dashed line gives the distribution of the 20 item criterion 
test for those students who correctly answered .the corresponding PMI item 
and the line with hash marks gives the distribution for those students who 
incorrectly answered the corresponding PMI item. Taking the criterion test 
as error free, the combined probability of misclassifying a student on the 
basis of the single PMI item is .35 and the probability of correctly classi-^ 
fying a student is 1 - - 0^ = .65 . The latter may be taken as an index 
of reliability for the one item test. Given the situation in which these 
data were collected, it is likely that the number of false positive classi- 
fications is inflated. Over a two week testing period in which the PMI was 
always administered first, it is likely that fatigue effects account for 
some of the false positives. It is also possible that the number of false 
negative classifications is inflated due to learning on the part of some 
studenls. This last effect is a(cenlur.ccd in other distributions from the 
same study. 

li 
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Figure 4 shows the distribution of a 15 item criterion test with the 
distributions for those students who showed mastery on a five item CRT and 
those students whc showed non-mastery on the five item CRT, Mastery was 
defined as four out of five items correct. These distributions came from 
data collected for the PIRAMID project (project: Iiidividualized R^eading 
And Mathematics, JLnter-District) • This project was initiated by a consor- 
tium of California school districts. This objective is specific and has 
the U-shaped form indicative of an underlying dichotomous true score. In 
this case, taking the criterion tast as error free, there is one false 
negative classification based on the five item CRT. The probability of 
misclassif ication is .01 and the index of reliability is ,99. Needless 
to say, such results are quite uncommon. 

It should be noted that if three independent measures of the same 
objective are available, then all relevant parameters for the three vari- 
ables can be computed. There are seven parameters: a^, a..j a^, 3^^, 62 > 
and Nq and seven independent equations available from the two by two 
by two cube of observed frequencies. 

Figure 5, which I have labeled Case 3: a dichotomous measure of a 
continuous true score, shows a traditional kind of item Ci»aracteristic curve 
(if t^e dichotomous measure is an item scored 0 or 1) with the addition of 
an assumed crxterion true score, 6, indicating mastery of the objective. The 
probability of making an error of classification in this case is indicated 
by the shaded area and is equal to P(X ==1 | T < 0) + P(X =0 | T 0) . 
Empirical item characteristic curves, where scores on a pool of items written 
to measure one objective are substituted for true score, could be useful in 
item selection when a reasonable cut score, X^, has been established. In 
this case, characteristic curv<»s which cross the cut score near their center 
should be more sensitive to instruction than curves which do not. The 
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importance of sensitivity to instruction in criterion-referencccl tej?t item 
selection has been discussed elsewhere (Cox and Var^ias, 1966; Roudabush, 
1973), 

Figure 6, labeled Case 4: a pseudo continuous mea.'iure of a continuous 
true score, is the situation assumed and most discussed in the literature on 
criterion-referenced tecjting. It requires that a criterion observed score, 
X , be established and that a criterion true score, 6, be assummed. The 
probability of making an error of classification is shown by the shaded area 
and is equal to: P(X > X I T < 0) + P(X < X I T > 6). The solid line in 
Figure 7 shows the distribution of a nine item objective score which would 
seem to fit this model. Also plotted is the distribution for students who 
showed mastery and non-mastery on a three item measure of the same objective 
where the criterion of mastery was set at two out of the three items correct. 
The items for both scores were taken from the f ryout data for the Prescriptive 
Reading Inventory (1972). Notice that, if the nine item test is taken as an 
error free criterion test, there is no cut score that does not result in large 
errors of classification. For this reason, in the published PRI we trichotomize 
objective scores leaving a middle ground for "review" between mastery and non- 
mastery of each objective in the test. 

In this last case, a step function approach to estimating error given 
any particular value of 6 such as that of Reed (undated) or Humbleton and 
Novick (1973) is appropriate • 
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CASE 1: OiCHOTOMOUS MEASURE - DICHOTOMOUS TRUE SCORE 



I 

ii 



T 



1 



P(e)« P(X-1 IT-O) + P{X»01T-1) 



Figure 1. A dichotomous measure of a dichotomous true score showing the 
protabiliiy of making an error of classification based on the plichotomous. 
- measure. 



CASE 2: PSEUDO COMTINUOUS MEASURE - DICHOTOMOUS TRUE SCORE 



/ I 
I t 

I : I 

I 
f 



■ « CRITERION OBSERVED SCORE 

r 

P(e)» P(X>)^|T-0) + P(X<XclT-1) 



T 



Figure 2.- ' A pceudo conii'nuous measure of a dicholomous trub score showing 
tho pruUbility of. making an error ojf classification based on the pseudo. 
continuous measuret 
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CASE 3: DlCHOTCr.lOUS MEASURE - COWTINUOUS TRUE SCORE 




0 - CRITERION TRUE SCORE 
PfE)= p(X» 1 |T<e) + P(X-0|T>e) 



Figure 5. A dichotomous measure of a continuous true score showing the 
protabillty of jaaking an error of classification based on- the dichotomous 
measure given criterion true score 



. CASE 4: PSEUDO CONTINUOUS MEASURE - CONTINUOUS TRUE SCORE 

Xg » CRITERION OBSERVED SCORE 

e « CRITERION TRUE SCORE 

P(e)« p|x>XgiT<e) + p(x<XgiT>e) 



« 


N 




1 
1 




/ 


1 




r 


— - 









e 



ERIC 



Fif:;in.e o. A poJ^ado conLihuouo r.]'.-ia6ura of a continuous troio score .^hovdn^: 
the proUifilUy of mkirif^ an orror of clc?or>ificatio>i tap.ed on bho pjcudo 
continuouj^ .h.'^j^sulg with criterion <\x\\ p^vcn cxiUrion true ricore 6. 
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